Goal of this analysis and research design

In this analysis I fit annotation models to data collected to validate a measurement instrument of populism in textual data proposed by Hua, Abou-Chadi and Barbera.1 Specifically, I obtain the estimates of the Bayesian beta-binomial by annotator (BBA) model proposed by Carpenter using MCMC methods.2 In a binary classification context, the BBA model estimates positive instances prevalence, coder-specific abilities as sensitivity and specificity parameters, as well as items’ class membership (see below for a detailed discussion).

The goal of this exercise is to assess the following questions for each dimensions of the measurement instrument (see below for a detailed discussion):

  1. What is the average posterior classification uncertainty when aggregating coders’ judgments at the item-leve?
  2. What is the variation in posterior classification uncertainty?
  3. What is the (posterior) distribution of coders’ abilities?

These questions are asked with an eye on the overarching goal to further improve and validate the measurement instrument proposed by Hua et al.

Specifically, I want to be able to implement coding experiments that allow me to answer the following questions (among others):

  1. How does the measurement quality achieved by aggregating crowd-sourced codings change as the the number of codings aggregated per document is increased?
  2. What is the average number of codings per document required to achieve a predetermined level of measurement quality?
  3. How does the number of codings required per document to achieve a predetermined level of measurement quality vary across documents?
  4. Are model-based document-level labels biased relative to gold standard labels?
  5. Is the bias in model-based document-level labeling decreasing with the number of codings aggregated per document?
  6. How strong is the agreement between majority-winner with model-based labels?
  7. What is the agreement (i) between majority-winner labels and the gold standard, and (ii) between model-based labels and the gold standard?
  8. How does agreement among methods and their agreement with the gold standard, respectively, change as the number of codings aggregated per document is increased?

The quantitity of interest are thus the change in measurement quality metrics as the number of judgments aggregated per item, \(n_i\), is increased in integer steps. Measurement quality can be operationalized in three distinct ways:

  1. classification uncertainty, the corpus-level mean of and variation in the item-level standard deviation (across chains and iterations) of posterior classifications;
  2. accuracy, the corpus-level aggregate of model-based posterior classifications’ agreement with external gold standard labels at the item level;
  3. bias, the corpus-level ratio of false-positive to false-negative posterior classification reltative to external gold standard labels; and
  4. intercoder reliability, the intercoder agreement as measured by Krippendorff’s \(\alpha\) or Fleiss’s \(\kappa\).

Generally, we hypothesis the following changes as \(n_i\) is increased:

  1. classification uncertainty decreases,
  2. accuracy increases,
  3. bias decreases, and
  4. intercoder reliability increased.

To be able to scrutinize these hypotheses, it is imperative to first conduct sample size calculations. Sample size analysis are conducted to eventually allow the assertain that implemented experiments have enough statistical power to detect substantially relevant differences in statistics, and generally require the following information: (i) estimates of the mean and variance of the metric in the population, (ii) desired Type-I and II error probabilities, (iii) the difference sought to be detected.

However, except item (ii), this information is not available: We neither know the averages and the variability of the quality metrics (item i), nor due we know a priori what magnitudes of change we canexpect as \(n_i\) is increased (item iii).

In order to obtain reasonable bounds on these quantities, I thus pursue a two-pronged strategy:

  1. I will use the validation datasets collected by Hua et al. to obtain posterior estimates of the average and variance of these metrics in a setting where no item was judged by more than four coders, i.e., \(n_i = 1,\ \ldots,\ 4\).
  2. I will use this posterior knowledge to parameterize a simulation study designed to assess the change in quality metrics as a function of \(n_i\).

Implementing these two steps will then allow me to answer what magnitudes of change in quality metrics can be expected as \(n_i\) is increased, say from three to four, as well as the average values and variability of these metrics for each value of \(n_i\). These estimates can then be used to compute the sample sizes required to detect differences that are about the size of the changes observed in the simulation study.

The original measurement instrument

Hua et al. recruited crowd workers on the crowd-sourcing platform CrowdFlower to code social media posts created by a selected number of accounts of Wesrtern European parties and their leaders according to the following coding scheme:

  1. filter questions:
  2. This post has no text or its content is impossible to understand (if applies, skip to next social media post)
  3. I understand the message of this social media post (if applies, proceed with answering questions 2-4)
  4. anti-elitism: Does this tweet/post criticize or mention in a negative way the elites? (No/Yes)
  5. people-centrism: Does this tweet/post mention in a positive way or even praise the people (citizens of the country, the working class, the native …) or the nation? (No/Yes)
  6. exclusionism: Does this tweet/post criticize minorities or specific groups of people (muslims, jews, LGBT people, poor people …)? (No/Yes)

Fitting BBA models to crowd-sourced measurements

First, some general setup:

# set the file path
file_path <- file.path("~", "switchdrive", "Documents", "work", "phd", "methods", "crowd-sourced_annotation")

# load required namespaces

library(dplyr)
library(purrr)
library(tidyr)
library(rjags)
library(ggplot2)
library(ggridges)
library(icr)

# set seed
set.seed(1234)

# load internal helpers

helpers <- c(
  "compute_maxexpec.R"
  , "get_mcmc_estimates.R"
  , "get_codings_data.R"
  , "transform_betabin_fit_posthoc.R"
)

{sapply(file.path(file_path, "code", "R", helpers), source); NULL}
## NULL

The data

Next, we load and inspect the original validation datasets.

No. Judgments No. Coders \(N\)
1 1 507
2 2 89
3 3 902
4 4 2

To obtain their validation data, Hua et al. crowd-sourced judgments from 1 different coders for a set of 1 different social media posts (items). Each item was coded between 1 and four times. For each item that was coded multiple times, no coder provided more than one judgment (i.e., no repeated coding).

The data was collected on the crowd-sourcing platform CrowdFlower and comes with a set of filter and meta variables that need to be taken into account when constructing the datasets used to obtain posterior class estimates. The variable filter, for instance has the following realizations in our data:

Number of judgments in validation data by filter type
Filter type \(N\)
ok 1489
notok 39
notok 4
Of all 3399 ju dgments, we want to retain only those that have the type ‘ok’.
The following code just implements this:
codings <- dat %>% 
  # keep only judgements that are 'ok'
  filter("ok" == filter) %>%
  # replace "" with NA in string vectors
  mutate_if(is.character, Vectorize(function(x) if (x == "") NA_character_ else x)) %>%
  mutate(
    # judgement index
    index = row_number(),
    # item index
    item = group_indices(., `_unit_id`) ,
    # coder index
    coder = group_indices(., `_worker_id`),
    # populism indiactors
    elites = elites == "yes",
    exclusionary = exclusionary == "yes",
    people = people == "yes",
    populist = people & elites,
    right_populist = people & elites & exclusionary
  ) %>% 
  tbl_df()

n_judgments <- nrow(codings)
n_coders <- length(unique(codings$coder))
n_items <- length(unique(codings$item))

This leaves us with 1489 items judged by between one and four coders.

Model specification

We estimate beta-binomial by annotator models to judgments for each dimension separately.

All models share the same parametrization:

\[ \begin{align*} c_i &\sim\ \mbox{Bernoulli}(\pi)\\ \theta_{0j} &\sim\ \mbox{Beta}(\alpha_0 , \beta_0)\\ \theta_{1j} &\sim\ \mbox{Beta}(\alpha_1 , \beta_1)\\ y_{ij} &\sim\ \mbox{Bernoulli}(c_i\theta_{1j} + (1 - c_i)(1 - \theta_{0j}))\\ {}&{}\\ \pi &\sim\ \mbox{Beta}(1,1)\\ \alpha_0/(\alpha_0 + \beta_0) &\sim\ \mbox{Beta}(1,1)\\ \alpha_0+\beta_0 &\sim\ \mbox{Pareto}(1.5)\\ \alpha_1/(\alpha_1 + \beta_1) &\sim\ \mbox{Beta}(1,1)\\ \alpha_1+\beta_1 &\sim\ \mbox{Pareto}(1.5) \end{align*} \] where

  • \(c_i\) is the ‘true’ (unobserved) class of item \(i\),
  • \(\pi\) is the ‘true’ prevalence of the positive class,
  • \(\theta_{0j}\) is coder \(j\)’s specificity (true-negative rate),
  • \(\theta_{1j}\) is her sensitivity (true-positive rate), and
  • \(\alpha_\cdot, \beta_\cdot\) are the parameters of the Beta-distributions from which coders’ specificities and sensitivities are drawn from (hyperpriors are parameterized in terms of their mean and scales).

All priors are choosen to be uninformative, as we have no prior knowledge in about coders’ abilities or prevalence in this particular domain.

Befor proceeding to estiamting people-centrism in posts, some general JAGS setup.

# load DIC modeul
load.module("dic")

# global model parameters
n_chains <- 3
model_file_path <- file.path(file_path, "models", "beta-binomial_by_annotator.jags")
fit_file_path <- file.path(file_path, "fits", "betabinom_by_annotator_populism.RData")

People-centrism

We begin with the first dimension of the measurment instrument: people-centrism. In this context, the positive class unites posts that feature people-centrist statements. From studies in other domains (news articles, speeches), we expect the prevalence to not exceed 40%. With regard to coders abilities, we expect most coders to be non-adversarial (i.e., their judgments are not negatively correlated with item classes), as crowd workers were allowed to participate only if they successfuly compelted eight out of ten initial gold screening tasks. As these beliefs are however not supported by domain-specific data, I decided to go with uninformative priors.

Estimation

I obtain MCMC estimates using JAGS with three chains, 5K burn-in iterations, and 40K iterations with thinning parameter set to 20. These choices are based on inspecting convergence and autocorrelation in initial models with fewer iterations and less (or no) thinning.

# subset codings
people_codings <- codings %>%
  mutate(judgment = as.integer(people)) %>% 
  filter(!is.na(judgment)) %>% 
  select(index, item, coder, judgment) 

# contruct model-compatible MCMC data object
people_mcmc_data <- get_codings_data(people_codings)

# initialization values
init_vals <- lapply(1:n_chains, function(chain) {
  
  out <- list()
  out[["pi"]] <- .2 + rnorm(1, 0, .05)
  out[[".RNG.name"]] <- "base::Wichmann-Hill"
  out[[".RNG.seed"]] <- 1234
  
  return(out)
})

# initilaize model
people_mcmc_model <- jags.model(
  file = model_file_path
  , data = people_mcmc_data
  , inits = init_vals
  , n.chains = n_chains
)

# update: 1K burnins
update(people_mcmc_model, 1000)

# fit model
people_mcmc_fit <- coda.samples(
  people_mcmc_model
  , variable.names = c(
    "deviance"
    , "pi"
    , "c"
    , "theta0", "theta1"
    , "alpha0", "beta0"
    , "alpha1", "beta1"
  )
  , n.iter = 40000
  , thin = 20
)

fits <- list()
fits$people_mcmc_fit <- people_mcmc_fit

First, we want to ensure that the model converged and chains are well-mixed. To do so, we inspect convergence of the deviance information criterion:

Convergence is achieved very quickly: the shrinkage factor is close to one already after iterations. Keeping only every 20th estimate helps to reduce autocorrelation substantially.

Posterior prevalence of people-centrism

Turning to the posterior density of \(\pi\), the prevalence of people-centrism in social media posts, it becomes immediately apparent that all three chains have converged on the reverse assignment, a problem resulting from the non-identifiability of the measurement model.3 Hence, I use post-hoc transformation to obtain the correct assignment.4

The posterior density of \(\pi\) is unimodal and, after post-hoc transformation, has its mean value 0.107. Importantly, with a total 3339 valid codings provided for 1489, the prevalence posterior density exhibits relatively little dispersion given that we have given the prevalence an uninformative (flat) Beta(1,1) prior density: 90% os posterior values are in the range 0.078, 0.143.

Posterior classification uncertainty

Turning to posterior classification uncertainties, we see that with only a few judgments per item, there is much variability in classification uncertainty when aggregating classifications across chains and iterations at the item level:5

While the majority of items can be assigned with little posterior classification uncertainty, there are items with both moderate (\(\text{SD}(c_i) \in [.25, .4)\)) and high (\(\text{SD}(c_i) \geq .4\)) levels of posterior classification uncertainty.6

Hence, for people-centrism in the validation items, we get the following mean and standard devation values of posterior classification uncertainty:

Posterior classification uncertainty in people-centrism classification
Average S.D.
0.167 0.118

Posterior coder abilities

In addition to classification quality, we are also interested in the distribution of coder abilities.

First, we can inspect posterior estimates of coders’ sensitivity and specificity parameters:

The picture is relatively homogenous: Coders are generally found to be highly specific, that is, to perform well in correctly classifying negative items. The mass of most posterior densities of \(\theta_{0\cdot}\) parameters is in the range \([.75,1)\). With regard to coders’ true-positive detection abilities, there are some outliers with sensitivities in the range \([.4, .6]\) (e.g., coders 7-9, 17, 38, and 39) and even substantial posterior densitiy mass below .5 (specifically coder 32). Hence, the distribution of posterior means is more dispersed in case of sensitivities than specificities:

Having specified uninformative priors, the validation data gives reason to believe that the sampled coders are somewhat heterogenous in terms of classification abilities, at least with regard to sensitivities. This is confirmed when looking at the distributions of hyperparameters of sensitivity and specificity distributions:

Only \(\beta_0\), the second shape parameter of the specificity hyperdistribution, can be estimated with comparatively high precision. Take for instance the shape parameters of coders’ sensitivities, \(\alpha_1, \beta_1\). 80% of their values lie in the range \(\alpha_1 \in\) [1.493, 12.435] and \(\beta_1 \in\) [0.678, 6.827]. Due to the flexibility of the Beta-distribution into which these hyperparameters feed, depending on the selected quantile values we get differently shaped posterior densitie, as the next figure illustrates:

From the ability hyperdistributions we can conclude that the mass of coders are very specific and overwhelmingly non-adversarial but less perfect when classifying true-positive items.

(Dis)Agreement with majority-voting classifications

Most studies applying content analytical (i.e., human-coding based) instruments to measure populism in textual data use majority voting to aggregate codings at the item level. However, majority voting may produce biased results if coders provide noisy judgments. Hence, we want to know whether or not model-based posterior estimates and majority voting imply different classifications.

(Dis)Agreement of model-based posterior classification and majority voting for people-centrism classification.
Agree Posterior Classification \(n_i\) \(N\) Proportion
no 0 1 33 0.022
no 0 2 1 0.001
no 0 3 5 0.003
no 1 3 1 0.001
yes 0 1 446 0.300
yes 0 2 95 0.064
yes 0 3 788 0.529
yes 0 4 2 0.001
yes 1 1 34 0.023
yes 1 2 8 0.005
yes 1 3 76 0.051

Indeed there are in total only 40 out of 1489 items (i.e., 2.686%) for which model-based and majority-voting classifications disagree. The vast share of this disagreement results from items that are classified as featuring people-centrism with majortiy voting but not when using BBA model-based aggregation (39). Importantly, this disagreement occurs most often where only one coder judged an item.

As a consequence of these differences, the empirical prevalence (not to confuse with \(\pi\)) differs somewhat between classification methods: 0.105 in case of majority voting vs. 0.08 in model-based classification.

Anti-elitism

Estimation

I obtain MCMC estimates using JAGS with three chains, 5K burn-in iterations, and 15K iterations with thinning parameter set to 15. These choices are based on inspecting convergence and autocorrelation in initial models with fewer iterations and less (or no) thinning.

Judging by the DIC, all chains mix nicely and converege quickly. And with the thinning parameter set to 15, we effectly reduce autocorrelation to tolerable levels.

As in the case of people-centrism classification, the model however converged on the inverse parameter assignment, so that we need to post-hoc transform estimates to the correct assignment.

Posterior prevalence of anti-elitism

The posterior density is unimodal and the mean of the prevalence is 0.282, that is, in expectation about every fifth social media post generated by party leaders or party accounts features anti-elitism.

Posterior classification uncertainty

We see that with only a few judgments per item, there is much variability in classification uncertainty when aggregating across chains and iterations at the item level:

While about one third of items can be assigned with little posterior classification uncertainty, a substantial number of items is characterized with moderate to high levels of posterior classification uncertainty (\(\text{SD}(c_i) \geq .25\)). Hence, we have the following mean and standard devation values of posterior classification uncertainty in anti-elitism classification:

Posterior classification uncertainty in anti-elitism classification
Average S.D.
0.211 0.142

Posterior coder abilities

Inspecting posterior estimates of coders’ sensitivity and specificity parameters, the picture is similar to that in case of people-centrism classification: Coders are generally highly specific, yet the samopled coders are more heterogenous with regard to their abilities to correctly classify positive items, as is illustrated by the following figure:

Having specified uninformative priors, the validation data gives reason to believe that the coder population is somewhat heterogenous in terms of classification abilities, but more often than not non-adversarial and better than chance. This is confirmed when looking at the distributions of hyperparameters of sensitivity and specificity distributions:

Compared to hyperparameter estimates in the case of people-centrism classification, densities are less dispersed with the minor exception of \(\alpha_0\) Take for instance the shape parameters of coders’ specificities, \(\alpha_1, \beta_1\). 80% of their values lie in the range \(\alpha_1 \in\) [1.133, 3.81] and \(\beta_1 \in\) [0.394, 1.413]. Due to the flexibility of the Beta-distribution into which these hyperparameters feed, depending on the selected quantile values we get differently shaped posterior densitie, as the next plot illustrates.

With above-median hyperparameter values, however, posterior ability distributions have their vast shares of mass on non-adversarial values (i.e., > .5), and again, we have reason to believe that coders are both highly specific as well as, though somewhat less so, sensitive.

(Dis)Agreement with majority-voting classifications

Again, we want to know whether or not model-based posterior estimates and majority voting imply different classifications.

(Dis)Agreement of model-based posterior classification and majority voting for anit-elite classification.
Agree Posterior Classification \(n_i\) \(N\) Proportion
no 0 3 5 0.003
no 1 3 24 0.016
yes 0 1 380 0.255
yes 0 2 89 0.060
yes 0 3 612 0.411
yes 1 1 133 0.089
yes 1 2 15 0.010
yes 1 3 229 0.154
yes 1 4 2 0.001

Indeed there are in total only 29 out of 1489 items for which model-based and majority-voting classifications disagree. As a consequence of these differences, the empirical prevalence (not to confuse with \(\pi\)) differes only slightly between classification methods: 0.258 in case of majority voting vs. 0.271 in model-based classification.

Exclusionism

Estimation

I obtain MCMC estimates using JAGS with three chains, 10K burn-in iterations, and 100K iterations with thinning parameter set to 50. These choices are based on inspecting convergence and autocorrelation in initial models with fewer iterations and less (or no) thinning.

Judging by the DIC, though chains tend to mix nicely, there is some downwards drift in DIC values that only levels off after the first 50K iterations. Hence, the shrinkage factor approaches one only after some ten thousand iterations. What is more, with the thinning parameter set to 50 we still have substantial autocorrelation. The estimates obtained by fitting the BBA model to the exclusionism judgments are thus to be taken with a grain of salt.

Not also that the model again converged on the inverse parameter assignment, so that I post-hoc transformed estimates to the correct assignment.

Posterior prevalence of exclusionism

The posterior density is unimodal, the mean of the prevalence is 0.096, and 90% of posterior estimates are in the range 0.073, 0.116.

Posterior classification uncertainty

We see that with only a few judgments per item, there already relaitively little classification uncertainty in most items:

The mass of items can be assigned with little posterior classification uncertainty, and there are only very few items with moderate to high levels of posterior classification uncertainty (\(\text{SD}(c_i) \geq .25\)). For exclusionism in the validation items, we get the following mean and standard devation values of posterior classification uncertainty:

Posterior classification uncertainty in exclusionism classification
Average S.D.
0.116 0.077

Posterior coder abilities

In addition to classification quality, we are also interested in the distribution of coder abilities in exclusionism classification.

Inspecting posterior estimates of coders’ sensitivity and specificity parameters, we get a relatively clear-cut and familiar picture. Posterior estimates of both coders’ sensitivities and specificities are virutally all non-adversarial, and the mass of posterioir densities lies in regions that indicate better-than-chance classification abilities. Again, coders are somewhat more heterogenous wirth regard to true-positive classification abilities, as the following plot illustrates:

Given the validation data we have reason to believe that the coder population may be somewhat heterogenous in terms of true-positive classification abilities, whereas it is highly homogeneous in terms of true-negative classification abilities. This is supported when looking at the distributions of hyperparameters of sensitivity and specificity distributions:

With the minor exception of \(\beta_0\), which is characterized by a relatively tight credibility interval, densities are extremely dispersed. The resulting hyperdistributions, illustrated in the next figure, give reason to believe that the mass of the coder population is close to perfect in true-negative classification, and less perect, more heterogenous but overwhelmingly non-adversarial and better-than-chance in true-positive classification.

(Dis)Agreement with majority-voting classifications

Again, we want to know whether or not model-based posterior estimates and majority voting imply different classifications.

(Dis)Agreement of model-based posterior classification and majority voting for exclusionism classification.
Agree Posterior Classification \(n_i\) \(N\) Proportion
no 1 3 30 0.020
yes 0 1 480 0.322
yes 0 2 101 0.068
yes 0 3 780 0.524
yes 0 4 1 0.001
yes 1 1 33 0.022
yes 1 2 3 0.002
yes 1 3 60 0.040
yes 1 4 1 0.001

Indeed there are in total only 30 out of 1489 items for which model-based and majority-voting classifications disagree. All disagreement results from items that are classified as featuring exclusionism when using BBA model-based aggregation but not with majortiy voting. But as a consequence of random tie-breaking, the empirical prevalence still differs somewhat between classification methods: 0.065 in case of majority voting vs. 0.085 in model-based classification.

Summary

To sum up, we are n now ready to answer the questions raised above about the mean and standard deviations of posterior classification uncertainties.

  • In case of people-centrism classification, the average uncertainty in posterior classification as measured by the item-level standard deviation in class assignments agreegated across chains and iterations is 0.167 and its standard deviation is 0.118.
  • In case of anti-elitms classification, the average uncertainty in posterior classification is 0.211 and its standard deviation is 0.142.
  • Finally, in case of exlusionism classification, the average uncertainty in posterior classification is 0.116 and its standard deviation is 0.077.

Based in this data we can then perform power analysis to compute thesample sizes required to detect select levels of distance. The differences we are generally interest is the change in agreement or measurement quality metrics as the number of judgments aggregated per item, \(n_i\), is increased in integer steps.

With the exception of bias, which cannot be assessed here due to the lack of gold-standard labels, the hypotheses formualted at the outset of this analysis are all directional and thus demand one-tailed tests. Say, we would want to be able to detect a decrease of 10% in average posterior classification uncertainty, that is, an average value of 0.15 instead of 0.167 in case of people-centrism classification, an average of 0.19 instead of 0.211 in case of anti-elitism classification, and an average of 0.19 instead of 0.211 in case of exclusionism classification. For set confidence levels \(\alpha = .1\) and statistical power \(\beta = .9\), we then would need let multiple coders judge at least 135, 121, and 117 items, respectively.

With regard to intercoder reliability metrics, we obtain the following statistics by generating 1000 bootstrapped estimates:

Bootstrapped statistics of intercoder realiability metrics Krippendorff’s \(\alpha\)
Dimension Average S.D.
People-centrism 0.301 0.032
Anti-elitism 0.429 0.021
Exclusionism 0.625 0.033

Say we want to be able to detect increases in intercoder reliability of 2.5%. Then we need to collect judgments for 50 for people-centrism,
11 for anti-elitism, and 13 for exclusionism classification, respectively.

Simulation study

[to be added]

  1. Hua, Whitney, Tarik Abou-Chadi, and Pablo Barberá. “Networked Populism: Characterizing the Public Rhetoric of Populist Parties in Europe.” 2018. Paper prepared for the 2018 EPSA Conference.

  2. Carpenter, Bob. “Multilevel bayesian models of categorical data annotation.” 2008. Unpublished manuscript

  3. Carpenter (2008, 7)

  4. All chains converge on the ‘reversed’ parameter assignment of \(\mathcal{P} = \left(\{c_i\}_{i\in 1, \ldots, n}, \pi, \{\theta_{j0}\}_{j\in\,1, \ldots, m}, \{\theta_{j1}\}_{j\in\,1, \ldots, m}, \alpha_0, \beta_0, \alpha_1, \beta_1 \right)\): \(\mathcal{P}' = \left( \{1-c_i\}_{i\in 1, \ldots, n}, 1-\pi, \{1-\theta_{j1}\}_{j\in\,1, \ldots, m}, \{1-\theta_{j0}\}_{j\in\,1, \ldots, m} \beta_1, \alpha_1, \beta_0, \alpha_0 \right)\) In the reversed assignment \(\mathcal{P}'\), \(c_i' = 1- c_i\), the prevalence is reflected around 0.5, and the sensitiv- ity and specicity parameters are swapped and reflected around 0.5 (Carpenter, 2008, 7f.).

  5. Here, posterior classification uncertainty at the item level is measured as the standard deviation of posterior classifications across chains and interations.

  6. In binary classification, the theoretical maximum value of posterior classification uncertainty is achieved when in 50% of iterations the item is assigned to the positive class, and else to the negative class.